Project 2 - Data Cubbing and Binning

renan
In this post apply binning to pre-process the dataset
Author
Affiliation

Master of Data Science Program @ The University of West Florida (UWF)

For the Project 2 we further processed our dataset by leveraging binning and a data cube structure built on hierarchies .

1. Introduction

Analyzing the State_time_series.csv dataset, which contains granular real estate metrics over time, benefits significantly from data transformation and aggregation. Binning is essential for this process, as it converts continuous variables, such as MedianListingPrice_AllHomes, into discrete, manageable categories (e.g., ‘$150k-$200k’, ‘$200k-$250k’). This discretization simplifies complex data, making it easier to summarize, visualize, and identify trends.

Furthermore, leveraging a data cube structure built on hierarchies allows for powerful, multidimensional analysis. For instance, the Date field isn’t just a single point in time; it’s part of a hierarchy that can be “rolled up” from a specific day to a Month, Quarter, or Year. Similarly, the RegionName (State) could be aggregated into broader geographical regions (e.g., “Northeast”, “West Coast”). By combining these binned and hierarchical dimensions, we can quickly “slice and dice” the data to answer complex questions, such as “How many homes in the $200k-$250k price bin were available in the Northeast region during Q3 2018?” This turns a massive, raw dataset into a flexible tool for gaining actionable insights.

The Zillow Home Value Index or ZHVI is a smoothed, seasonally adjusted measure of the typical home value and market changes across a given region and housing type. It reflects the typical value for homes in the 35th to 65th percentile range.

The Dataset Zillow Economics Data[ZHVI?], can be downloaded:

```{bash}
#!/bin/bash
curl -L -o ~/Downloads/zecon.zip\
  https://www.kaggle.com/api/v1/datasets/download/zillow/zecon
```

1.1 Load packages

# install.packages("fpp3")
library(fpp3)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggtime)
library(patchwork)
library(DT)

Possible errors makings notes:

```{bash}
Attaching package: ‘ggtime’

The following objects are masked from ‘package:feasts’:

    gg_arma, gg_irf, gg_lag, gg_season, gg_subseries, gg_tsdisplay,
    gg_tsresiduals
```

2. Load the Dataset

The following code will locate the folder datasets and then it will give back the variable datasets_path which you can use to build the path to the desired data to be loaded.

Code
find_git_root <- function(start = getwd()) {
  path <- normalizePath(start, winslash = "/", mustWork = TRUE)
  while (path != dirname(path)) {
    if (dir.exists(file.path(path, ".git"))) return(path)
    path <- dirname(path)
  }
  stop("No .git directory found — are you inside a Git repository?")
}

repo_root <- find_git_root()
datasets_path <- file.path(repo_root, "datasets")
zillow_economics_data_path <- file.path(datasets_path, "zillow-economics-data-01")

state_time_series <- file.path(zillow_economics_data_path, "State_time_series.csv")
all_states_data <- read.csv(state_time_series)

3. Data Exploration and Processing

The ‘Date’ column is a character and should be converted to a Date object.

all_states_data$Date <- as.Date(all_states_data$Date)
# str(all_states_data)

Exploring the N/As and the structure of the dataset:

# Get the total number of rows for calculating percentages
total_rows <- nrow(all_states_data)

# Create a pretty summary table
na_summary <- all_states_data %>%
  # 1. Count NAs for every column
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  
  # 2. Pivot the data from wide to long
  pivot_longer(everything(),
               names_to = "Column",
               values_to = "NA_Count") %>%
  
  # 3. (Optional) Filter to only show columns that HAVE NAs
  filter(NA_Count > 0) %>%
  
  # 4. (Optional) Add a percentage column
  mutate(NA_Percentage = (NA_Count / total_rows) * 100) %>%

  # 5. Sort by the highest NA count
  arrange(desc(NA_Count))

# Print the clean table
print(na_summary)
# A tibble: 80 × 3
   Column                                  NA_Count NA_Percentage
   <chr>                                      <int>         <dbl>
 1 PctOfHomesSellingForGain_AllHomes          12609          95.4
 2 PctOfHomesSellingForLoss_AllHomes          12609          95.4
 3 MedianRentalPrice_5BedroomOrMore           11994          90.8
 4 MedianRentalPricePerSqft_5BedroomOrMore    11752          88.9
 5 MedianRentalPricePerSqft_Studio            10875          82.3
 6 MedianRentalPrice_CondoCoop                10437          79.0
 7 MedianRentalPricePerSqft_DuplexTriplex     10293          77.9
 8 MedianRentalPrice_Studio                   10211          77.3
 9 MedianListingPrice_1Bedroom                10205          77.2
10 MedianRentalPrice_DuplexTriplex            10068          76.2
# ℹ 70 more rows

Sample of the Dataset in the form of a data table to make visually clear how the dataset looks like and what features are available.

print("all_states_data (tsibble):")
[1] "all_states_data (tsibble):"
DT::datatable(all_states_data, options = list(pageLength = 5))
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
library(skimr)
skim(all_states_data)
Data summary
Name all_states_data
Number of rows 13212
Number of columns 82
_______________________
Column type frequency:
character 1
Date 1
numeric 80
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
RegionName 0 1 4 18 0 52 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
Date 0 1 1996-04-30 2017-12-31 2007-05-31 261

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DaysOnZillow_AllHomes 8367 0.37 110.12 27.47 49.25 90.25 108.50 126.75 251.62 ▃▇▃▁▁
InventorySeasonallyAdjusted_AllHomes 8316 0.37 33292.66 34926.70 972.00 9828.50 21712.50 47453.00 260687.00 ▇▂▁▁▁
InventoryRaw_AllHomes 8316 0.37 33299.01 35014.16 911.00 9756.25 21289.00 46891.00 268055.00 ▇▂▁▁▁
MedianListingPricePerSqft_1Bedroom 9626 0.27 182.47 99.81 57.14 125.69 162.75 202.63 627.55 ▇▅▁▁▁
MedianListingPricePerSqft_2Bedroom 8678 0.34 135.49 74.25 60.00 92.16 121.30 152.24 550.64 ▇▂▁▁▁
MedianListingPricePerSqft_3Bedroom 8605 0.35 129.45 58.84 56.48 93.28 116.63 143.19 460.47 ▇▃▁▁▁
MedianListingPricePerSqft_4Bedroom 8535 0.35 133.46 58.34 61.80 99.60 119.82 144.75 480.77 ▇▂▁▁▁
MedianListingPricePerSqft_5BedroomOrMore 8643 0.35 135.67 64.92 63.78 99.08 119.75 149.35 617.96 ▇▁▁▁▁
MedianListingPricePerSqft_AllHomes 8538 0.35 136.66 68.76 62.14 96.01 120.58 153.44 520.72 ▇▂▁▁▁
MedianListingPricePerSqft_CondoCoop 9063 0.31 163.55 99.42 61.92 113.08 141.00 177.07 1000.00 ▇▁▁▁▁
MedianListingPricePerSqft_DuplexTriplex 9248 0.30 97.09 57.02 32.14 60.89 81.21 113.52 446.43 ▇▂▁▁▁
MedianListingPricePerSqft_SingleFamilyResidence 8573 0.35 133.37 60.79 63.27 95.55 120.09 149.84 475.36 ▇▂▁▁▁
MedianListingPrice_1Bedroom 10205 0.23 147083.00 67877.11 49900.00 99000.00 130000.00 169900.00 399000.00 ▇▇▂▁▁
MedianListingPrice_2Bedroom 8839 0.33 158872.73 80629.40 57000.00 109500.00 147000.00 179900.00 599000.00 ▇▃▁▁▁
MedianListingPrice_3Bedroom 8842 0.33 209225.76 89233.54 109900.00 149000.00 189900.00 240000.00 687000.00 ▇▃▁▁▁
MedianListingPrice_4Bedroom 8876 0.33 310849.76 115574.82 169000.00 238744.50 283020.00 339900.00 950000.00 ▇▃▁▁▁
MedianListingPrice_5BedroomOrMore 8989 0.32 416375.41 191106.68 159900.00 310000.00 369700.00 452172.50 1847500.00 ▇▂▁▁▁
MedianListingPrice_AllHomes 8966 0.32 223378.49 84461.60 112944.00 159900.00 209000.00 259900.00 610000.00 ▇▆▁▁▁
MedianListingPrice_CondoCoop 9402 0.29 202333.42 79067.15 82500.00 152362.50 184900.00 228000.00 754500.00 ▇▃▁▁▁
MedianListingPrice_DuplexTriplex 9323 0.29 207474.74 119710.56 64900.00 129900.00 178900.00 245000.00 939000.00 ▇▂▁▁▁
MedianListingPrice_SingleFamilyResidence 9082 0.31 228169.61 96836.13 112900.00 159900.00 209900.00 265000.00 725000.00 ▇▅▁▁▁
MedianPctOfPriceReduction_AllHomes 8724 0.34 3.85 0.82 1.74 3.24 3.72 4.35 8.34 ▂▇▃▁▁
MedianPctOfPriceReduction_CondoCoop 9340 0.29 4.02 1.06 1.68 3.26 3.81 4.56 10.00 ▅▇▂▁▁
MedianPctOfPriceReduction_SingleFamilyResidence 8724 0.34 3.85 0.82 1.72 3.23 3.74 4.35 8.35 ▂▇▃▁▁
MedianPriceCutDollar_AllHomes 8724 0.34 8034.11 3071.98 5000.00 5100.00 7500.00 10000.00 24000.00 ▇▆▁▁▁
MedianPriceCutDollar_CondoCoop 9340 0.29 7453.13 2890.12 2050.00 5000.00 6800.00 10000.00 27753.50 ▇▆▁▁▁
MedianPriceCutDollar_SingleFamilyResidence 8724 0.34 8244.97 3547.92 5000.00 5300.00 7900.00 10000.00 26000.00 ▇▅▁▁▁
MedianRentalPricePerSqft_1Bedroom 9588 0.27 1.40 0.57 0.72 0.99 1.21 1.61 3.37 ▇▅▂▁▁
MedianRentalPricePerSqft_2Bedroom 9065 0.31 1.11 0.45 0.52 0.81 0.95 1.26 3.11 ▇▅▁▁▁
MedianRentalPricePerSqft_3Bedroom 8985 0.32 0.97 0.32 0.53 0.75 0.86 1.07 2.45 ▇▃▁▁▁
MedianRentalPricePerSqft_4Bedroom 9808 0.26 0.85 0.25 0.49 0.68 0.78 0.94 2.13 ▇▅▁▁▁
MedianRentalPricePerSqft_5BedroomOrMore 11752 0.11 0.76 0.16 0.36 0.63 0.74 0.89 1.27 ▁▇▆▃▁
MedianRentalPricePerSqft_AllHomes 8864 0.33 1.01 0.42 0.58 0.76 0.88 1.12 3.15 ▇▂▁▁▁
MedianRentalPricePerSqft_CondoCoop 10004 0.24 1.30 0.68 0.64 0.90 1.07 1.41 4.82 ▇▁▁▁▁
MedianRentalPricePerSqft_DuplexTriplex 10293 0.22 1.07 0.56 0.50 0.74 0.87 1.12 3.86 ▇▁▁▁▁
MedianRentalPricePerSqft_MultiFamilyResidence5PlusUnits 9189 0.30 1.17 0.50 0.59 0.84 1.00 1.30 3.38 ▇▃▁▁▁
MedianRentalPricePerSqft_SingleFamilyResidence 8923 0.32 0.94 0.28 0.58 0.75 0.86 1.03 2.44 ▇▃▁▁▁
MedianRentalPricePerSqft_Studio 10875 0.18 1.18 0.74 0.57 0.74 0.85 1.26 3.98 ▇▂▁▁▁
MedianRentalPrice_1Bedroom 9686 0.27 978.56 431.03 495.00 650.00 860.00 1195.00 2690.00 ▇▃▁▁▁
MedianRentalPrice_2Bedroom 9168 0.31 1097.62 473.15 575.00 775.00 925.00 1279.00 3215.00 ▇▃▁▁▁
MedianRentalPrice_3Bedroom 9075 0.31 1356.52 465.58 750.00 1050.00 1200.00 1515.00 3550.00 ▇▃▁▁▁
MedianRentalPrice_4Bedroom 9856 0.25 1722.48 486.16 950.00 1395.00 1595.00 1950.00 3850.00 ▇▆▂▁▁
MedianRentalPrice_5BedroomOrMore 11994 0.09 2139.04 698.33 795.00 1750.00 1995.00 2495.00 4500.00 ▂▇▃▁▁
MedianRentalPrice_AllHomes 9060 0.31 1362.29 455.57 750.00 1050.00 1200.00 1590.00 3600.00 ▇▃▁▁▁
MedianRentalPrice_CondoCoop 10437 0.21 1410.25 480.29 697.50 1050.00 1295.00 1595.00 3200.00 ▇▇▂▂▁
MedianRentalPrice_DuplexTriplex 10068 0.24 963.56 434.44 500.00 685.00 800.00 1100.00 2895.00 ▇▂▁▁▁
MedianRentalPrice_MultiFamilyResidence5PlusUnits 9029 0.32 1091.13 449.92 550.00 750.00 950.00 1296.00 2895.00 ▇▃▁▁▁
MedianRentalPrice_SingleFamilyResidence 9120 0.31 1362.84 442.51 750.00 1050.00 1205.00 1582.25 3400.00 ▇▅▁▁▁
MedianRentalPrice_Studio 10211 0.23 1220.91 353.50 490.00 975.00 1150.00 1400.00 2500.00 ▂▇▅▁▁
ZHVIPerSqft_AllHomes 620 0.95 116.42 62.91 35.00 77.00 98.00 141.00 499.00 ▇▃▁▁▁
PctOfHomesDecreasingInValues_AllHomes 4292 0.68 33.29 22.47 0.18 16.98 27.45 45.08 99.38 ▇▇▃▂▁
PctOfHomesIncreasingInValues_AllHomes 4292 0.68 59.03 23.77 0.47 44.22 63.54 76.30 99.76 ▂▃▅▇▅
PctOfHomesSellingForGain_AllHomes 12609 0.05 85.88 11.99 50.99 79.09 89.97 95.63 100.00 ▁▁▂▃▇
PctOfHomesSellingForLoss_AllHomes 12609 0.05 14.12 11.99 0.00 4.36 10.03 20.91 49.01 ▇▃▂▁▁
PctOfListingsWithPriceReductionsSeasAdj_AllHomes 8724 0.34 12.38 2.22 4.51 10.91 12.31 13.92 20.64 ▁▃▇▃▁
PctOfListingsWithPriceReductionsSeasAdj_CondoCoop 9164 0.31 10.39 2.60 2.23 8.53 10.38 12.19 19.61 ▁▅▇▃▁
PctOfListingsWithPriceReductionsSeasAdj_SingleFamilyResidence 8724 0.34 12.60 2.27 4.44 11.09 12.54 14.11 20.76 ▁▃▇▃▁
PctOfListingsWithPriceReductions_AllHomes 8724 0.34 12.36 3.00 3.56 10.23 12.30 14.37 20.92 ▁▅▇▅▁
PctOfListingsWithPriceReductions_CondoCoop 9164 0.31 10.38 3.00 1.79 8.25 10.25 12.38 22.90 ▁▇▇▂▁
PctOfListingsWithPriceReductions_SingleFamilyResidence 8724 0.34 12.58 3.07 3.74 10.42 12.50 14.62 21.69 ▁▅▇▅▁
PriceToRentRatio_AllHomes 8912 0.33 11.44 2.22 7.05 9.73 11.14 12.74 21.55 ▆▇▃▁▁
Sale_Counts 7837 0.41 7065.63 7978.75 130.00 1672.00 4546.00 9247.00 50275.00 ▇▂▁▁▁
Sale_Counts_Seas_Adj 7837 0.41 7049.04 7808.91 242.00 1713.00 4764.00 9393.00 41779.00 ▇▂▁▁▁
Sale_Prices 9218 0.30 194551.78 75755.80 83800.00 137000.00 180900.00 235775.00 543100.00 ▇▆▂▁▁
ZHVI_1bedroom 2607 0.80 117060.42 61056.04 30900.00 74600.00 100400.00 142300.00 390200.00 ▇▆▁▁▁
ZHVI_2bedroom 1467 0.89 135168.80 72267.65 32800.00 86700.00 115400.00 166800.00 542400.00 ▇▅▁▁▁
ZHVI_3bedroom 425 0.97 167062.80 80212.51 49600.00 116400.00 141200.00 204400.00 639700.00 ▇▃▁▁▁
ZHVI_4bedroom 853 0.94 243829.67 106216.76 64700.00 174900.00 218000.00 281000.00 850400.00 ▇▆▁▁▁
ZHVI_5BedroomOrMore 1398 0.89 323733.49 170291.68 68600.00 217900.00 288000.00 365300.00 1497000.00 ▇▂▁▁▁
ZHVI_AllHomes 774 0.94 169753.41 83882.41 38200.00 114500.00 144750.00 207600.00 620400.00 ▇▅▁▁▁
ZHVI_BottomTier 896 0.93 102669.97 49705.35 32600.00 66600.00 87400.00 128200.00 335600.00 ▇▅▂▁▁
ZHVI_CondoCoop 1530 0.88 156769.89 79695.00 42200.00 111300.00 134700.00 175800.00 782900.00 ▇▂▁▁▁
ZHVI_MiddleTier 774 0.94 169753.41 83882.41 38200.00 114500.00 144750.00 207600.00 620400.00 ▇▅▁▁▁
ZHVI_SingleFamilyResidence 774 0.94 174154.38 92243.37 37900.00 115000.00 147300.00 211775.00 737500.00 ▇▃▁▁▁
ZHVI_TopTier 688 0.95 293973.81 147829.06 70900.00 194700.00 251100.00 349400.00 988100.00 ▇▆▂▁▁
ZRI_AllHomes 8958 0.32 1321.33 371.01 799.00 1047.00 1210.00 1474.00 2690.00 ▇▇▂▁▁
ZRI_AllHomesPlusMultifamily 8876 0.33 1318.06 368.78 799.00 1036.00 1210.00 1477.00 2653.00 ▇▇▂▁▁
ZriPerSqft_AllHomes 8876 0.33 0.93 0.28 0.56 0.73 0.86 1.07 2.29 ▇▅▁▁▁
Zri_MultiFamilyResidenceRental 8876 0.33 1233.05 369.43 713.00 959.75 1126.00 1399.50 2606.00 ▇▅▃▁▁
Zri_SingleFamilyResidenceRental 8958 0.32 1327.52 383.31 799.00 1039.25 1220.00 1467.75 2754.00 ▇▆▂▁▁
# --- 1. Create Date Hierarchy ---
# We parse the Date and extract hierarchy levels
processed_data <- all_states_data %>%
  mutate(
    Year = year(Date),
    Quarter = quarter(Date, with_year = TRUE),
    Month = month(Date, label = TRUE)
  )
# --- 2. Create Bins for a Measure ---
# We bin 'MedianListingPrice_AllHomes' into categories
price_breaks <- c(0, 150000, 200000, 250000, 300000, 500000, Inf)
price_labels <- c(
  "Under $150k", "$150k-$200k", "$200k-$250k",
  "$250k-$300k", "$300k-$500k", "Over $500k"
)
processed_data <- processed_data %>%
  mutate(
    Price_Bin = cut(MedianListingPrice_AllHomes,
                    breaks = price_breaks,
                    labels = price_labels,
                    right = FALSE) # Bins are [min, max)
  )
# Show the new hierarchical and binned columns
print("Data with new hierarchy and bins:")
[1] "Data with new hierarchy and bins:"
glimpse(processed_data %>%
  select(RegionName, Date, Year, Quarter, Month, MedianListingPrice_AllHomes, Price_Bin))
Rows: 13,212
Columns: 7
$ RegionName                  <chr> "Alabama", "Arizona", "Arkansas", "Califor…
$ Date                        <date> 1996-04-30, 1996-04-30, 1996-04-30, 1996-…
$ Year                        <dbl> 1996, 1996, 1996, 1996, 1996, 1996, 1996, …
$ Quarter                     <dbl> 1996.2, 1996.2, 1996.2, 1996.2, 1996.2, 19…
$ Month                       <ord> Apr, Apr, Apr, Apr, Apr, Apr, Apr, Apr, Ap…
$ MedianListingPrice_AllHomes <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Price_Bin                   <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

4. Explore ZHVI with Date Hierarchy

For sake of simplicity we are going to focus for now on a single State (California) and the Zillow Home Value Index (ZHVI_AllHomes) over time.

# Let's analyze the Zillow Home Value Index (ZHVI_AllHomes) for "California".
# all_states_data
cali_zhvi <- processed_data %>%
  filter(RegionName == "California") %>%
  select(Date, ZHVI_AllHomes) %>%
  # Remove any missing values for this metric
  na.omit()

cali_ts <- cali_zhvi %>%
  as_tsibble(index = Date)
# We set 'Date' as the 'index' (the time component).
cali_ts_fill_gaps <- cali_zhvi %>%
  as_tsibble(index = Date) %>%
  fill_gaps()
# If you were analyzing multiple states, you would add a 'key'.
# Example for multiple states (not run here):
# multi_state_ts <- all_states_data %>%
#   select(Date, RegionName, ZHVI_AllHomes) %>%
#   as_tsibble(index = Date, key = RegionName)

Lets see how many N/As after filling the gaps:

# Count total NAs per column
colSums(is.na(cali_ts))
colSums(is.na(cali_ts_fill_gaps))
# Inspect the new, focused data frame
head(cali_zhvi)
        Date ZHVI_AllHomes
1 1996-04-30        157900
2 1996-05-31        157800
3 1996-06-30        157500
4 1996-07-31        157300
5 1996-08-31        157000
6 1996-09-30        156800
head(cali_ts)
# A tsibble: 6 x 2 [1D]
  Date       ZHVI_AllHomes
  <date>             <int>
1 1996-04-30        157900
2 1996-05-31        157800
3 1996-06-30        157500
4 1996-07-31        157300
5 1996-08-31        157000
6 1996-09-30        156800
head(cali_ts_fill_gaps)
# A tsibble: 6 x 2 [1D]
  Date       ZHVI_AllHomes
  <date>             <int>
1 1996-04-30        157900
2 1996-05-01            NA
3 1996-05-02            NA
4 1996-05-03            NA
5 1996-05-04            NA
6 1996-05-05            NA
cali_ts %>%
  autoplot(ZHVI_AllHomes) +
  labs(title = "Zillow Home Value Index (ZHVI) for California",
       y = "Home Value Index",
       x = "Year") +
  theme_minimal()

# This will scan cali_ts and report any gaps
gap_summary <- count_gaps(cali_ts)

# Print the summary
print(gap_summary)
# A tibble: 260 × 3
   .from      .to           .n
   <date>     <date>     <int>
 1 1996-05-01 1996-05-30    30
 2 1996-06-01 1996-06-29    29
 3 1996-07-01 1996-07-30    30
 4 1996-08-01 1996-08-30    30
 5 1996-09-01 1996-09-29    29
 6 1996-10-01 1996-10-30    30
 7 1996-11-01 1996-11-29    29
 8 1996-12-01 1996-12-30    30
 9 1997-01-01 1997-01-30    30
10 1997-02-01 1997-02-27    27
# ℹ 250 more rows
cali_ts_fill_gaps %>%
  autoplot(ZHVI_AllHomes) +
  labs(title = "Zillow Home Value Index (ZHVI) for California",
       y = "Home Value Index",
       x = "Year") +
  theme_minimal()

# Inspect the new, focused data frame
# print("cali_zhvi (data frame):")
# DT::datatable(cali_zhvi, options = list(pageLength = 5))

print("cali_ts (tsibble):")
[1] "cali_ts (tsibble):"
DT::datatable(cali_ts, options = list(pageLength = 5))
# print("cali_ts_fill_gaps (tsibble with filled gaps):")
# DT::datatable(cali_ts_fill_gaps, options = list(pageLength = 5))
  1. Create a new ‘Month’ column using the yearmonth() function
  2. Group by this new explicit month
  3. Summarise the data (using mean() is safe, but since you have one observation per month, last() or sum() would also work)
  4. Convert to a tsibble, now indexed by the new ‘Month’ object
cali_ts_monthly <- cali_zhvi %>%
  mutate(Month = yearmonth(Date)) %>%
  group_by(Month) %>%
  summarise(ZHVI_AllHomes = mean(ZHVI_AllHomes, na.rm = TRUE)) %>%
  as_tsibble(index = Month)

# Check the new object
print("New Monthly Tsibble:")
[1] "New Monthly Tsibble:"
head(cali_ts_monthly)
# A tsibble: 6 x 2 [1M]
     Month ZHVI_AllHomes
     <mth>         <dbl>
1 1996 Apr        157900
2 1996 May        157800
3 1996 Jun        157500
4 1996 Jul        157300
5 1996 Aug        157000
6 1996 Sep        156800
# Note the <mth> tag in the output, indicating it's a monthly tsibble
print("Class of the new index:")
[1] "Class of the new index:"
class(cali_ts_monthly$Month)
[1] "yearmonth"  "vctrs_vctr"
cali_ts_monthly %>%
  autoplot(ZHVI_AllHomes) +
  labs(title = "Zillow Home Value Index (ZHVI) for California",
       y = "Home Value Index",
       x = "Year") +
  theme_minimal()

In R, Date objects are stored internally as the number of days that have passed since an “origin” date, which by default is January 1, 1970.

The number 9587 is the number of days since 1970-01-01.

This corresponds to the date April 30, 1996, which is the starting point of your Zillow dataset.

This happens if you (or R) accidentally convert a Date object to a plain number.

The yearmonth object we created in the last step is different. It prints as “1996 Apr” and internally stores the number of months since the 1970 epoch (which would be a much smaller number, like 316).

print("cali_ts_monthly (tsibble with only monthly):")
[1] "cali_ts_monthly (tsibble with only monthly):"
DT::datatable(cali_ts_monthly, options = list(pageLength = 5))

Reflecting on how the ZHVI is calculated and how it compares to other ways to calculate Home value index like: Case-Shiller

The Standard & Poor’s CoreLogic Case–Shiller Home Price Indices are repeat-sales house price indices for the United States.

FipeZAP Index

Conceptual Accuracy (What it gets right)

The R code correctly captures the concept of measuring the middle-third of the market:

Focus on the Middle Tier: The core idea of calculating the ZHVI is to use an aggressively trimmed mean—specifically the mean of values between the 35th and 65th percentile. The code successfully identifies these cut-off points (\(\mathbf{q35}\) and \(\mathbf{q65}\)) and calculates the mean of the values within that range. This properly simulates the “robust” nature of the ZHVI, which aims to exclude extreme high-end and low-end sales that might skew a simple median or average.

Major Flaws (What it gets wrong)

  1. The Data Source is Incorrect (Sales vs. All Homes)
  • Your Code’s Flaw: The input data, hypothetical_sales, is a list of sales transactions.

  • Actual ZHVI Method: The ZHVI is calculated using the Zestimate for every single home (over 100 million properties), not just the few that sold in a given month.

    • Impact: Indices based on sales (like Case-Shiller) are susceptible to the compositional bias of which homes happen to sell that month. The ZHVI avoids this by including the estimated value (Zestimate) for all homes, providing a much more comprehensive view of the entire market’s value, whether a house sold or not.
  1. It Calculates a Static Value, Not Appreciation
  • Your Code’s Flaw: The output, simulated_zhvi, is a single-month dollar value.

  • Actual ZHVI Method: The ZHVI is an index that tracks the change in home values over time (appreciation). It is a chained index where the monthly value is determined by:

    1. Calculating the weighted mean appreciation of individual Zestimates from the prior month to the current month.

    2. Applying that appreciation rate to the ZHVI level from the prior month.

    • Impact: To calculate a correct ZHVI, you would need two separate sets of Zestimates (Time \(t-1\) and Time \(t\)), calculate the percentage change for each home, and then average those changes to determine the appreciation factor for the overall index. The code calculates a robust mean, but it doesn’t show how that mean changes month-over-month to create the index.

Hypothetical Fix for Flaw 2 (Conceptual only)

A conceptually more accurate approach would require comparing two months: 1. Month 1 Zestimates (t-1): Calculate the \(\text{Mean}_{35-65}\) of Zestimates for Month 1.

  1. Month 2 Zestimates (t): Calculate the \(\text{Mean}_{35-65}\) of Zestimates for Month 2.

  2. Appreciation:

\[\text{Monthly Appreciation} = \frac{\text{Mean}_{35-65}(t)}{\text{Mean}_{35-65}(t-1)} - 1\]

  1. ZHVI Calculation: \(\text{ZHVI}(t) = \text{ZHVI}(t-1) \times (1 + \text{Monthly Appreciation})\)
# this seems to be incorrect

# 1. Create a hypothetical set of 100 home sales.
# We'll use random numbers for this example.
set.seed(42) # Makes our "random" numbers reproducible
hypothetical_sales <- round(rnorm(100, mean = 350000, sd = 75000))

# 2. Find the 35th and 65th percentile values
# These are the "cut-off" points.
q35 <- quantile(hypothetical_sales, 0.35)
q65 <- quantile(hypothetical_sales, 0.65)

# 3. Filter to get only the "middle-tier" homes
# (i.e., homes with a value between the 35th and 65th percentile)
middle_tier_homes <- hypothetical_sales[
  hypothetical_sales >= q35 & hypothetical_sales <= q65
]

# 4. Calculate the "Simulated ZHVI"
# This is the mean of only those middle-tier homes.
simulated_zhvi <- mean(middle_tier_homes)

# --- Print the results ---
print(paste("Total number of hypothetical sales:", length(hypothetical_sales)))
[1] "Total number of hypothetical sales: 100"
print(paste("35th Percentile Value:", q35))
[1] "35th Percentile Value: 328947.5"
print(paste("65th Percentile Value:", q65))
[1] "65th Percentile Value: 389868.05"
print(paste("Number of homes in middle-tier (35th-65th percentile):", length(middle_tier_homes)))
[1] "Number of homes in middle-tier (35th-65th percentile): 30"
print(paste("Simulated ZHVI (Mean of middle-tier):", round(simulated_zhvi, 2)))
[1] "Simulated ZHVI (Mean of middle-tier): 359053.03"

ZHVI is probably being calculated from most likely the sales data that generated the MedianListingPrice_AllHomes.

MedianListingPrice_1Bedroom

MedianListingPrice_2Bedroom

MedianListingPrice_3Bedroom

MedianListingPrice_4Bedroom

MedianListingPrice_5BedroomOrMore

MedianListingPrice_AllHomes

MedianListingPrice_CondoCoop

MedianListingPrice_DuplexTriplex

MedianListingPrice_SingleFamilyResidence

References